Goto

Collaborating Authors

 Northern Cyprus


A Concise Review of Hallucinations in LLMs and their Mitigation

Pulkundwar, Parth, Dhanawade, Vivek, Yadav, Rohit, Sonkar, Minal, Asurlekar, Medha, Rathod, Sarita

arXiv.org Artificial Intelligence

Abstract--Traditional language models face a challenge from hallucinations. Their very presence casts a large, dangerous shadow over the promising realm of natural language processing. It becomes crucial to understand the various kinds of hallucinations that occur nowadays, their origins, and ways of reducing them. This document provides a concise and straightforward summary of that. It serves as a one-stop resource for a general understanding of hallucinations and how to mitigate them. In the fast-moving world of Natural Language Processing (NLP) today, large language models (LLMs) such as GPT, BERT, and others have become the principal agents of change in natural language processing. They can generate human-like text, answer multifaceted questions, or engage in conversation with as much fluency.


LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework

Lops, Andrea, Narducci, Fedelucio, Ragone, Azzurra, Trizio, Michelantonio, Bartolini, Claudio

arXiv.org Artificial Intelligence

Unit testing is an essential but resource-intensive step in software development, ensuring individual code units function correctly. This paper introduces AgoneTest, an automated evaluation framework for Large Language Model-generated (LLM) unit tests in Java. AgoneTest does not aim to propose a novel test generation algorithm; rather, it supports researchers and developers in comparing different LLMs and prompting strategies through a standardized end-to-end evaluation pipeline under realistic conditions. We introduce the Classes2Test dataset, which maps Java classes under test to their corresponding test classes, and a framework that integrates advanced evaluation metrics, such as mutation score and test smells, for a comprehensive assessment. Experimental results show that, for the subset of tests that compile, LLM-generated tests can match or exceed human-written tests in terms of coverage and defect detection. Our findings also demonstrate that enhanced prompting strategies contribute to test quality. AgoneTest clarifies the potential of LLMs in software testing and offers insights for future improvements in model design, prompt engineering, and testing practices.


Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data

Xu, Xuhai, Yao, Bingsheng, Dong, Yuanzhe, Gabriel, Saadia, Yu, Hong, Hendler, James, Ghassemi, Marzyeh, Dey, Anind K., Wang, Dakuo

arXiv.org Artificial Intelligence

Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.


Discovering Significant Topics from Legal Decisions with Selective Inference

Soh, Jerrold

arXiv.org Artificial Intelligence

We propose and evaluate an automated pipeline for discovering significant topics from legal decision texts by passing features synthesized with topic models through penalised regressions and post-selection significance tests. The method identifies case topics significantly correlated with outcomes, topic-word distributions which can be manually-interpreted to gain insights about significant topics, and case-topic weights which can be used to identify representative cases for each topic. We demonstrate the method on a new dataset of domain name disputes and a canonical dataset of European Court of Human Rights violation cases. Topic models based on latent semantic analysis as well as language model embeddings are evaluated. We show that topics derived by the pipeline are consistent with legal doctrines in both areas and can be useful in other related legal analysis tasks.


Characterizing Mechanisms for Factual Recall in Language Models

Yu, Qinan, Merullo, Jack, Pavlick, Ellie

arXiv.org Artificial Intelligence

Language Models (LMs) often must integrate facts they memorized in pretraining with new information that appears in a given context. These two sources can disagree, causing competition within the model, and it is unclear how an LM will resolve the conflict. On a dataset that queries for knowledge of world capitals, we investigate both distributional and mechanistic determinants of LM behavior in such situations. Specifically, we measure the proportion of the time an LM will use a counterfactual prefix (e.g., "The capital of Poland is London") to overwrite what it learned in pretraining ("Warsaw"). On Pythia and GPT2, the training frequency of both the query country ("Poland") and the in-context city ("London") highly affect the models' likelihood of using the counterfactual. We then use head attribution to identify individual attention heads that either promote the memorized answer or the in-context answer in the logits. By scaling up or down the value vector of these heads, we can control the likelihood of using the in-context answer on new data. This method can increase the rate of generating the in-context answer to 88\% of the time simply by scaling a single head at runtime. Our work contributes to a body of evidence showing that we can often localize model behaviors to specific components and provides a proof of concept for how future methods might control model behavior dynamically at runtime.


Explainable Data Poison Attacks on Human Emotion Evaluation Systems based on EEG Signals

Zhang, Zhibo, Umar, Sani, Hammadi, Ahmed Y. Al, Yoon, Sangyoung, Damiani, Ernesto, Ardagna, Claudio Agostino, Bena, Nicola, Yeun, Chan Yeob

arXiv.org Artificial Intelligence

The major aim of this paper is to explain the data poisoning attacks using label-flipping during the training stage of the electroencephalogram (EEG) signal-based human emotion evaluation systems deploying Machine Learning models from the attackers' perspective. Human emotion evaluation using EEG signals has consistently attracted a lot of research attention. The identification of human emotional states based on EEG signals is effective to detect potential internal threats caused by insider individuals. Nevertheless, EEG signal-based human emotion evaluation systems have shown several vulnerabilities to data poison attacks. The findings of the experiments demonstrate that the suggested data poison assaults are model-independently successful, although various models exhibit varying levels of resilience to the attacks. In addition, the data poison attacks on the EEG signal-based human emotion evaluation systems are explained with several Explainable Artificial Intelligence (XAI) methods, including Shapley Additive Explanation (SHAP) values, Local Interpretable Model-agnostic Explanations (LIME), and Generated Decision Trees. And the codes of this paper are publicly available on GitHub.


Bachelor in AI and ML Engineering - European Leadership University

#artificialintelligence

European Leadership University (ELU) has received it educational license in 2015 with the official decree and approval by the Ministry of Education and Culture in Turkish Republic of Northern Cyprus. We have received Institutional accreditation in 2016 and achieved full programme accreditation in 2017 by the Higher Education Planning, Evaluation, Accreditation and Coordination Council, Nicosia which is a member of The European Association for Quality Assurance in Higher Education (ENQA), the umbrella organisation for recognised government accreditation agencies in the European Higher Education Area (EHEA).


GAP2WSS: A Genetic Algorithm based on the Pareto Principle for Web Service Selection

Khatoonabadi, SayedHassan, Lotfi, Shahriar, Isazadeh, Ayaz

arXiv.org Artificial Intelligence

This paper presents a genetic algorithm by adopting the Pareto principle that is called GAP2WSS for selecting a Web service for each task of a composite Web service from a pool of candidate Web services. In contrast to the existing approaches, all global QoS constraints, interservice constraints, and transactional constraints are considered simultaneously. At first, all candidate Web services are scored and ranked per each task using the proposed mechanism. Then, the top 20 percent of the candidate Web services of each task are considered as the candidate Web services of the corresponding task to reduce the problem search space. Finally, the Web service selection problem is solved by focusing only on these 20 percent candidate Web services of each task using a genetic algorithm. Empirical studies demonstrate this approach leads to a higher efficiency and efficacy as compared with the case that all the candidate Web services are considered in solving the problem.


Turkey deploys surveillance drone to northern Cyprus amid gas drilling dispute

The Japan Times

ANKARA – Turkey has dispatched a surveillance and reconnaissance drone to the breakaway north of ethnically divided island nation of Cyprus amid tensions over offshore oil and gas exploration, Turkey's state-run media said Monday. The Anadolu news agency said the Turkish-made Bayraktar TB2 drone took off from an airbase in Dalaman, Turkey, and touched down Monday at the airport in Gecitkala -- known as Lefkoniko in Greek, on Cyprus. Kudret Ozersay, foreign minister of the self-declared Turkish Cypriot state, told reporters Sunday that the Turkish deployment would be limited to unarmed drones as there was "no need" for armed ones. Earlier, Turkish Cypriot Prime Minister Ersin Tatar said there was an "urgent need" to address the security concerns of Turkey and the Turkish Cypriots in the eastern Mediterranean. It's unclear what the drones will be specifically tasked to do.


Multi-robot Symmetric Rendezvous Search on the Line with an Unknown Initial Distance

Ozsoyeller, Deniz

arXiv.org Artificial Intelligence

In this paper, we study the symmetric rendezvous search problem on the line with n > 2 robots that are unaware of their locations and the initial distances between them. In the symmetric version of this problem, the robots execute the same strategy. The multi-robot symmetric rendezvous algorithm, MSR presented in this paper is an extension our symmetric rendezvous algorithm, SR presented in [23]. We study both the synchronous and asynchronous cases of the problem. The asynchronous version of MSR algorithm is called MASR algorithm. We consider that robots start executing MASR at different times. We perform the theoretical analysis of MSR and MASR, and show that their competitive ratios are $O(n^{0.67})$ and $O(n^{1.5})$, respectively. Finally, we confirm our theoretical results through simulations.